Now that we've got the basic package and a dataset up let's dive into the details of what's going on here. The goal is to give you enough of a foundation on network analysis to see how it can help answer research questions. It's also worth mentioned that in mathematics networks are often called graphs, and fit under the umbrella term graph theory hence the package name igraph. I'll use the terms graph and network interchangeably, while a visualisation of a network I'll try to call a plot (apologies in advance if I call a visualisation a graph!).
The basic components of networks aka graphs are nodes and edges.
In the dataset we're using, Star Wars characters are nodes, also know as a vertices (or vertex if singular). This is analogous to records in classic datasets like people or firms, which can include characteristics such as age or income. Nodes can have different characteristics too--usually called node attributes--and these can help in analysing and understanding the network.
Let's take a deeper look at the data we used to create the plot above:
nodes
As mentioned in a previous section there are two components of data for each of these nodes:
To see how that shows up when loaded into an igraph network use the V (vertices, synonymous with nodes) function:
V(g)
## + 22/22 vertices, named, from d36051a:
## [1] R2-D2 CHEWBACCA C-3PO LUKE DARTH VADER CAMIE
## [7] BIGGS LEIA BERU OWEN OBI-WAN MOTTI
## [13] TARKIN HAN GREEDO JABBA DODONNA GOLD LEADER
## [19] WEDGE RED LEADER RED TEN GOLD FIVE
This lists the vertices/nodes, the id it holds in memory and how many there are. The name attribute is particularly helpful in graph visualisation as it shows up automatically with igraph plots. If you've not used R before, here's a useful way to look at parts of data (usually columns in a table) using the $ symbol after the variable name, followed by the column name (like in the nodes variable) or the attribute name (like in the g network variable).
nodes$name
V(g)$name
We also get a list of all attributes with
vertex_attr(g)
## $name
## [1] "R2-D2" "CHEWBACCA" "C-3PO" "LUKE" "DARTH VADER"
## [6] "CAMIE" "BIGGS" "LEIA" "BERU" "OWEN"
## [11] "OBI-WAN" "MOTTI" "TARKIN" "HAN" "GREEDO"
## [16] "JABBA" "DODONNA" "GOLD LEADER" "WEDGE" "RED LEADER"
## [21] "RED TEN" "GOLD FIVE"
##
## $id
## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Let's add another attribute. Following the classic notion of the force in start wars we can add data indicating which side of the force characters are associated with (or othe if neighther). First we create vectors of names using the c (combine) function to get a vector of strings (names of nodes in this case).
# Create
dark_side <- c("DARTH VADER", "MOTTI", "TARKIN")
light_side <- c("R2-D2", "CHEWBACCA", "C-3PO", "LUKE", "CAMIE", "BIGGS",
"LEIA", "BERU", "OWEN", "OBI-WAN", "HAN", "DODONNA",
"GOLD LEADER", "WEDGE", "RED LEADER", "RED TEN", "GOLD FIVE")
neutral <- c("GREEDO", "JABBA")
Now we can add colours to the nodes based on the vectors of names we just created. R has a set of colour names that can be used in plots, feel free to pick your own. By storing one of these colour names in the new color attribute, we can have that directly show up in the visualisation. Note the spelling of color: if you add to a colour coloumn the data is stored but may not automatically affect the plot.
To add the color attribute, first create the column with an NA (refers to 'not available') value
# Add the color attribute to the network nodes
V(g)$color <- NA # Initialse the new 'color' attribute as NA for all nodes
then fill it up using the categories saved above. If you also run V(h)$color in between adding these you'll see the how the $color column gets populated
V(g)$color[V(g)$name %in% dark_side] <- "red" # set the dark side color name to red
V(g)$color
## [1] NA NA NA NA "red" NA NA NA NA NA NA "red"
## [13] "red" NA NA NA NA NA NA NA NA NA
To break this down: the %in% operation tests if data in one column---in this case the $name variable which is to the left of %in%---matches with (is in) the options on the right of the %in%. Where it does match it returns TRUE, and elsewhere it returns FALSE.
V(g)$color[V(g)$name %in% light_side] <- "gold" # set the light side color name to gold
V(g)$color[V(g)$name %in% neutral] <- "green" # set the color of neutral characters to green
V(g)$color
## [1] "gold" "gold" "gold" "gold" "red" "gold" "gold" "gold" "gold"
## [10] "gold" "gold" "red" "red" "gold" "green" "green" "gold" "gold"
## [19] "gold" "gold" "gold" "gold"
In this case that means the colour value intended gets saved in places where they match the dark, light and neural sides. It's complicated to explain but extremely useful.
These attributes can help us look at subgraphs: portions of the graph such as just the dark_side:
dark_side_graph <- induced_subgraph(g, dark_side) # Using the dark_side variable from above
V(dark_side_graph)
## + 3/3 vertices, named, from 923f46e:
## [1] DARTH VADER MOTTI TARKIN
This raises an important point: so many aspects of the world can be thought of as a network, and just as populations are sampled to make summary claims---such as income distribution or age---networks are often sampled for analysis. And just as we need to up front about sampling methods in many research contexts, we need to be aware that often there are portions of networks we cannot observe. Those sections may be very important, and in failing to observe them we can end up with very different structures and very different results.
In the classic sense of probability theory the law of large numbers suggests that often 1000 trials of an experiment, or random sampling from a population (often needing weighting to be accurate), can lead to representative results for the whole population. This can unfortunately be very difficult to manage in the case of networks (Browne 2005).
So: often quantitaive analysis of networks involves subgraphs, and it's worth being aware of that when analysing. Keep that in mind: we'll come back to this.
It's worth acknowledging that we're focusing on specific attributes that help with visualisation. I got frustrated perparing some of these slides because I tried doing something like this but with variations in spelling...
# load the data to a new variable (f) in the same way we loaded for g prior to adding colour
f <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE)
# Add a colour attribute which using the British rather than American spelling...
V(f)$colour <- NA # Initialse the new 'colour' attribute as NA for all nodes
V(f)$colour[V(f)$name %in% dark_side] <- "red" # set the dark side colour name to red
V(f)$colour[V(f)$name %in% light_side] <- "gold" # set the light side colour name to gold
V(f)$colour[V(f)$name %in% neutral] <- "green" # set the colour of neutral characters to green
plot(f)
Notice no difference in colour. Now compare that with
plot(g)
This should illustrate two things:
data.frame like the nodes variablecolor in igraphSimilarly: you can visualise specific sections of a graph, and that can be helpful in providing better detail/easier to read. Returning to the dark_side_graph subsample
plot(dark_side_graph)
This is a lot easier to read and can, as a way of zooming in, give us a clearer picture of some aspects of the structure. This becomes more crucial with much larger network datasets. It's already difficult to read the names of characters in these plots. Let's compare it with the other side of the force:
light_side_graph <- induced_subgraph(g, light_side) # Using the light_side variable
plot(light_side_graph)
Interesting and detailed by itself, but again worth being aware that it can make a significant difference comparing this with the rest of the network, and potentially misleading without acknowledging the network it's sampled from.
Like many other quantitative methodologies there are many other types of attributes that we can apply to nodes such as
We'll return to the applicablity of these later but generally: most variables that can be used in classic statistical analysis can be applied in network analysis. It might be hard... but that data can generally be useful.
To close: be wary of sampling issues in network analysis!
Edges, also called links and ties, are the connections in networks/graphs. They can be friendship, kinship, contracts, following, liking, debt, etc. Our conversation right now is via a digital network, but if we were in a lab there would be conversations face to face, just with a lot more physical movement and scribbling on a white board.
To get started let's look at the second data file we loaded in the beginning
head(edges)
## source target weight
## 1 C-3PO R2-D2 17
## 2 LUKE R2-D2 13
## 3 OBI-WAN R2-D2 6
## 4 LEIA R2-D2 5
## 5 HAN R2-D2 5
## 6 CHEWBACCA R2-D2 3
The head and tail functions are very handy ways to peak at datasets, especially very large ones. By default they return 6 records from the start or end of a data.frame respectively.
edges has three variables, the first two of which specify edges and the last is an attribute. We'll look at these in turn
Edges can be categorised as directed or undericted. So far, we've been working with an undirected graph, and that's why we include the directed=FALSE parameter in creating g: g <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE).
But it's no accident that the first two column names of the edges file are source and target. Different network packages follow different conventions (see the appendix for other packages) but directional information can be very important in networks. Someone can like someone else's tweet, just as one company can offer a contract to another or academic papers (hopefully) get cited by in other papers. It's important in these cases to specify where the direction of the connection comes from, and some of the first social network analysis research (then called sociograms (Festinger 1954)) use directional infromation:
Sociogram
In this study, people were individually asked to name friends at school. This diagram of friendship in a 4th grade class (US term for UK year 5) in the 1930s is a famous demonstration of a case where friendship is highly correlated with gender. If you look closesly (try ctrl/cmd + to zoom in) you'll see arrows pointing at circles such as from EL to SH while in a few other cases like the connection between BR and MC there are no arrows.
This means that EL named SH as a friend but SH didn't reciprocate (also name EL as a friend), while BR and MC both named each other as friends.
Here's a newer diagram of the same network which is a bit easier to read from http://www.martingrandjean.ch/social-network-analysis-visualization-morenos-sociograms-revisited/:
Directed Network
With the additional information of the shapes in the first digram, which map to colours in the second, this is a way of demonstrating the separation of social groups by gender and also how those groups are very weakly tied together. Weakness in as much as only one connection is named across gender and it is not reciprocated. There are ways of quantifying how different groups are connected in networks which is outside the scope of the session today, but there is lots of research on this in many other types network analysis including social (Leskovec, Lang, and Mahoney 2010). CONSIDER!!!
To summarise: undirected networks have two basic states between nodes
Undirected Networks
while directed networks have 3 basic states between nodes
Directed Networks
And returning to the way we constructed g originally:
g <- graph_from_data_frame(d=edges, vertices=nodes, directed=FALSE)
we can also construct that network as directed by leaving out directed=FALSE because the default value is directed=TRUE
d <- graph_from_data_frame(d=edges, vertices=nodes, directed=TRUE)
plot(d)
and just to help you remember, this is equivalent to the default state without including the directed parameter (I've forgotten this myself many times)
d <- graph_from_data_frame(d=edges, vertices=nodes)
plot(d)
Just like nodes, edges can have attributes. The directionality in the previous section is an example of information associated with a tie, and the presence or absence of one. One of the most common examples of tie attributes is included in the star wars dataset, and is usually described as a weighted tie. This is the third column in edges labeled weight
head(edges)
## source target weight
## 1 C-3PO R2-D2 17
## 2 LUKE R2-D2 13
## 3 OBI-WAN R2-D2 6
## 4 LEIA R2-D2 5
## 5 HAN R2-D2 5
## 6 CHEWBACCA R2-D2 3
This is the number of scenes that both characters share in the film. If combined with the source and target information then it's a means of weighting directed ties/edges.
With all this in mind, we can look at the original Star Wars network for a glimpse of what's included in the whole structure:
g
## IGRAPH 9b2b514 UNW- 22 60 --
## + attr: name (v/c), id (v/n), weight (e/n)
## + edges from 9b2b514 (vertex names):
## [1] R2-D2 --C-3PO R2-D2 --LUKE R2-D2 --OBI-WAN
## [4] R2-D2 --LEIA R2-D2 --HAN R2-D2 --CHEWBACCA
## [7] R2-D2 --DODONNA CHEWBACCA --OBI-WAN CHEWBACCA --C-3PO
## [10] CHEWBACCA --LUKE CHEWBACCA --HAN CHEWBACCA --LEIA
## [13] CHEWBACCA --DARTH VADER CHEWBACCA --DODONNA LUKE --CAMIE
## [16] CAMIE --BIGGS LUKE --BIGGS DARTH VADER--LEIA
## [19] LUKE --BERU BERU --OWEN C-3PO --BERU
## [22] LUKE --OWEN C-3PO --LUKE C-3PO --OWEN
## + ... omitted several edges
It's a bit technical but what shown is a summary of the network object:
U means undirectedN means named graph (hence the names attribute)W means weighted graph (hence the weight attribute)22 is the number of nodes60 is the number of edgesname (v/c) means name is a node attribute and it's a character (aka a string, or list of characters)weight (e/n) means weight is an edge attribute and it's numericThe rows are indicating connections between nodes so record [1] is between R2-D2 and C-3PO. Also note for those of you used to python: R is 1 indexed rather than 0 indexed...
Similar to the vertices V() function there is an edge function E() which prints the connections section of the summary of g just described.
E(g)
## + 60/60 edges from 9b2b514 (vertex names):
## [1] R2-D2 --C-3PO R2-D2 --LUKE R2-D2 --OBI-WAN
## [4] R2-D2 --LEIA R2-D2 --HAN R2-D2 --CHEWBACCA
## [7] R2-D2 --DODONNA CHEWBACCA --OBI-WAN CHEWBACCA --C-3PO
## [10] CHEWBACCA --LUKE CHEWBACCA --HAN CHEWBACCA --LEIA
## [13] CHEWBACCA --DARTH VADER CHEWBACCA --DODONNA LUKE --CAMIE
## [16] CAMIE --BIGGS LUKE --BIGGS DARTH VADER--LEIA
## [19] LUKE --BERU BERU --OWEN C-3PO --BERU
## [22] LUKE --OWEN C-3PO --LUKE C-3PO --OWEN
## [25] C-3PO --LEIA LUKE --LEIA LEIA --BERU
## [28] LUKE --OBI-WAN C-3PO --OBI-WAN LEIA --OBI-WAN
## + ... omitted several edges
Also in similarity to nodes, edge attributes are accessed by $:
E(g)$weight
## [1] 17 13 6 5 5 3 1 7 5 16 19 11 1 1 2 2 4 1 3 3 2 3 18 2 6
## [26] 17 1 19 6 1 2 1 7 9 26 1 1 6 1 1 13 1 1 1 1 1 1 2 1 1
## [51] 3 3 1 1 3 1 2 1 1 1
and the list of attributes is printed by the edge_attr function
edge_attr(g)
## $weight
## [1] 17 13 6 5 5 3 1 7 5 16 19 11 1 1 2 2 4 1 3 3 2 3 18 2 6
## [26] 17 1 19 6 1 2 1 7 9 26 1 1 6 1 1 13 1 1 1 1 1 1 2 1 1
## [51] 3 3 1 1 3 1 2 1 1 1
I realise this is going very quickly but the point is to demonstrate how similar accessing edge attribute information is to accessing node attributes.
With this demonstrated, we can now try adding another edge attribute. This is trickier than adding node attributes because there are so many edges (and this gets more complicated depending on whether directed or undirected) but similar to the method we used before:
E(g)$color <- "blue"
E(g)$color[E(g)$weight >= 5] <- "red"
This is a very simple case where we're colouring edges based on weight, once again using the standard colour names that come with R. We've set the color criteria to be for edges of greater weight than five (of which there are very few).
One again using the edge_attr function we can see what's been added:
edge_attr(g)
## $weight
## [1] 17 13 6 5 5 3 1 7 5 16 19 11 1 1 2 2 4 1 3 3 2 3 18 2 6
## [26] 17 1 19 6 1 2 1 7 9 26 1 1 6 1 1 13 1 1 1 1 1 1 2 1 1
## [51] 3 3 1 1 3 1 2 1 1 1
##
## $color
## [1] "red" "red" "red" "red" "red" "blue" "blue" "red" "red" "red"
## [11] "red" "red" "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue"
## [21] "blue" "blue" "red" "blue" "red" "red" "blue" "red" "red" "blue"
## [31] "blue" "blue" "red" "red" "red" "blue" "blue" "red" "blue" "blue"
## [41] "red" "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue"
## [51] "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue" "blue"
And similar to node attributes, color has the extra significance of automatically being added to plots (though as this is few edges it might not show up obviously on your screen).
plot(g)
Hope it's clear on some screens. We've obviously got issues with positioning which is to come. We're very close to finishing the edge section. We've got one more part to cover then on to the details of visualisation.
But before we do, once again we can use this visualisation (and the weight attribute to answer a question): which pair of characters share the most scenes in star wars IV?
So far the edges have all been via lists of pairs of node names, but other ways of representing edges is via a matrix, and that features a lot in other network packages. I'm just going to briefly show you that now
g[]
## 22 x 22 sparse Matrix of class "dgCMatrix"
## [[ suppressing 22 column names 'R2-D2', 'CHEWBACCA', 'C-3PO' ... ]]
##
## R2-D2 . 3 17 13 . . . 5 . . 6 . . 5 . . 1 . . . . .
## CHEWBACCA 3 . 5 16 1 . . 11 . . 7 . . 19 . . 1 . . . . .
## C-3PO 17 5 . 18 . . 1 6 2 2 6 . . 6 . . . . . 1 . .
## LUKE 13 16 18 . . 2 4 17 3 3 19 . . 26 . . 1 1 2 3 1 .
## DARTH VADER . 1 . . . . . 1 . . 1 1 7 . . . . . . . . .
## CAMIE . . . 2 . . 2 . . . . . . . . . . . . . . .
## BIGGS . . 1 4 . 2 . 1 . . . . . . . . . 1 2 3 . .
## LEIA 5 11 6 17 1 . 1 . 1 . 1 1 1 13 . . . . . 1 . .
## BERU . . 2 3 . . . 1 . 3 . . . . . . . . . . . .
## OWEN . . 2 3 . . . . 3 . . . . . . . . . . . . .
## OBI-WAN 6 7 6 19 1 . . 1 . . . . . 9 . . . . . . . .
## MOTTI . . . . 1 . . 1 . . . . 2 . . . . . . . . .
## TARKIN . . . . 7 . . 1 . . . 2 . . . . . . . . . .
## HAN 5 19 6 26 . . . 13 . . 9 . . . 1 1 . . . . . .
## GREEDO . . . . . . . . . . . . . 1 . . . . . . . .
## JABBA . . . . . . . . . . . . . 1 . . . . . . . .
## DODONNA 1 1 . 1 . . . . . . . . . . . . . 1 1 . . .
## GOLD LEADER . . . 1 . . 1 . . . . . . . . . 1 . 1 1 . .
## WEDGE . . . 2 . . 2 . . . . . . . . . 1 1 . 3 . .
## RED LEADER . . 1 3 . . 3 1 . . . . . . . . . 1 3 . 1 .
## RED TEN . . . 1 . . . . . . . . . . . . . . . 1 . .
## GOLD FIVE . . . . . . . . . . . . . . . . . . . . . .
For those of you interested in the details, the matrix is symmetrical if undirected and potentially asymetrical if directed. Each row and column are lists of connections between one node and all the others (including itself, which is the diagonal down the middle). I'm going to leave that there for today but just to give you some idea of where that comes from and what that means if you see a package asking for an ajacency matrix. CONSIDER!!!!!!!!!!!
Note: there aren't any cases of links to oneself in the Star Wars example (which wouldn't make sense in a film... except for maybe a sci-fi time travel one) but in other cases that can be clearer, such as emailing yourself (something I do all the time as reminders).
Finally down to visualisation! We've obviously spent almost all this time on constructing the graph and understanding the data that composes it. But what's really annoying throughout (and generally a very hard problem, for which there are many packages) is a good visual presentation of a network. There's lots of work on this but to just give you a taste we're going to dive into the plot function and the layout parameter, keeping the endge highlighting we've been working on
par(mfrow=c(2, 3), mar=c(0,0,1,0))
plot(g, layout=layout_randomly, main="Random")
plot(g, layout=layout_in_circle, main="Circle")
plot(g, layout=layout_as_star, main="Star")
plot(g, layout=layout_as_tree, main="Tree")
plot(g, layout=layout_on_grid, main="Grid")
plot(g, layout=layout_with_fr, main="Force-directed")
These are just some of the layout options to choose from. The layout_with_fr is a favorite of many, so that's what I used at the start. So to get closer to what I showed at the statrt:
plot(g, layout=layout_with_fr, main="Force-directed")
Note that the size of the plot for the grid was specified, but for the inidividual example I've left it blank (default). The plot is entirely based on a random number generator, so everyone's will look a bit different. We can reproduce the same arrangement by specifying a seed (coming up below).
Bringing the node colouring back in, we can make the most of node and edge attributes.
V(g)$color <- NA
V(g)$color[V(h)$name %in% dark_side] <- "red"
V(g)$color[V(h)$name %in% light_side] <- "gold"
V(g)$color[V(h)$name %in% neutral] <- "green"
par(mfrow=c(2, 3), mar=c(0,0,1,0))
plot(g, layout=layout_randomly, main="Random")
plot(g, layout=layout_in_circle, main="Circle")
plot(g, layout=layout_as_star, main="Star")
plot(g, layout=layout_as_tree, main="Tree")
plot(g, layout=layout_on_grid, main="Grid")
plot(g, layout=layout_with_fr, main="Force-directed")
set.seed(3339) # Speficy a seed for reproducing network layout
plot(g, layout=layout_with_fr, main="Force-directed")
legend(x=.75, y=.75, legend=c("Dark side", "Light side", "Neutral"),
pch=21, pt.bg=c("red", "gold", "green"), pt.cex=2, bty="n")
Here we've now specified the set.seed(3339) command, which means that the random number generator should produce the same results on your screen.
But: should we just keep trying different numbers until it looks good? That's a really annoying problem, and a reason why there are many options for visualising networks. igraph has a basic options to help with this but it can be difficult to set up and at the moment it's more likely to work on windows and linux than macOS. There is a workaround for macOS: https://github.com/sethrfore/homebrew-r-srf but now's not the time to try.
set.seed(3339) # Speficy a seed for reproducing network layout
tkplot(g, )
legend(x=.75, y=.75, legend=c("Dark side", "Light side", "Neutral"),
pch=21, pt.bg=c("red", "gold", "green"), pt.cex=2, bty="n")
There are a number of other options including
ggnetwork: R package for use with ggplotgephi: a standalone package for interactive arrangements https://gephi.org/ `` We've covered loading and basic visualisation of network data. To see how it works on data that might be closer to research you're interested in we're going to run a workshop with another, more real world dataset from twitter.
In the folder with the Star Wars data you should see two data files called congress-twitter-network-edges.csv and congress-twitter-network-nodes.csv.
Your task, should you choose to accept it, is to load these datafiles, visualise them, and demonstrate a research question it can answer. May the force plot be with you.
So far, all we've done is look at visualising networks, which is a form of analysis hence the research questions we considered, but it has many ideosyncracies and can easily be customised in ways that may be very helpful to one interpretation of results, but that can also be very misleading. Further, network analysis as a whole has often led to a lot of arguments in more rigorous analysis and interpretation.
If any of you saw my taster session, this might be familiar. For the rest: this is a video of a presentation around a very highly cited network analysis study assessing the social contagiousness of behaviour that leads to obesity. Note this involves another dimension we haven't discussed which is unfortunately quite complex in network analysis: change over time.
So it's worth once more bearing in mind that network analysis is hard, and there are issues with methods, application and interpretability. But: keeping that in mind, we shall endeavor.
One of the first and most straight forward components of quantitative assessment is density. Density is a measure of how many connections there are between nodes divided to how many possible connections. It's often just called network density, but the igraph method is edge_density to highlight that it's specific to edges.
edge_density(g)
## [1] 0.2597403
Reciprocity is specific to directed networks so let's return to the previously created d (for directed) network for this FIX:
is_directed(d)
## [1] TRUE
reciprocity(d)
## [1] 0
This is a ratio of how may edges are reciprocated, which is the ratio between unidirected edges---where entity \(A\) is connected to entity \(B\), but not the other way around
\[a \de b\]
versus bidirected edges where both \(a\) and \(b\) are connected
\[a \leftrightarrow b\]
The default metric here is relative to connections not including unconnected pairs of nodes. There's another option where all possibilities are included
reciprocity(d, mode="ratio")
## [1] 0
It's worth being careful which one you use justifying which in analysis.
A component is a portion of a network, often referred to as a subgraph but sometimes is a short hand for the largest connected component. These are ways of looking at and assessing separate (because of no ties) parts of a network.
The first thing to consider in these cases is whether the network is connected, meaning that every node has at least one edge to another.
is_connected(g)
## [1] FALSE
This is applicable irrespective of whether the network is directed. In this case we have one node with no edges so we have two components.
V(g)
## + 22/22 vertices, named, from 9b2b514:
## [1] R2-D2 CHEWBACCA C-3PO LUKE DARTH VADER CAMIE
## [7] BIGGS LEIA BERU OWEN OBI-WAN MOTTI
## [13] TARKIN HAN GREEDO JABBA DODONNA GOLD LEADER
## [19] WEDGE RED LEADER RED TEN GOLD FIVE
components(g)
## $membership
## R2-D2 CHEWBACCA C-3PO LUKE DARTH VADER CAMIE
## 1 1 1 1 1 1
## BIGGS LEIA BERU OWEN OBI-WAN MOTTI
## 1 1 1 1 1 1
## TARKIN HAN GREEDO JABBA DODONNA GOLD LEADER
## 1 1 1 1 1 1
## WEDGE RED LEADER RED TEN GOLD FIVE
## 1 1 1 2
##
## $csize
## [1] 21 1
##
## $no
## [1] 2
$no indicates the number of components, $csize is the size of these components and $membership lists which components each node is in. Note: this isn't a count of components, nodes can only be in 1. GOLD FIVE is in component 2, not 2 components.
Take what we've covered and apply it to the twitter dataset. # What else?
Browne, Kath. 2005. “Snowball Sampling: Using Social Networks to Research Non-Heterosexual Women.” International Journal of Social Research Methodology 8 (1): 47–60. doi:10.1080/1364557032000081663.
Christakis, Nicholas A., and James H. Fowler. 2013. “Social Contagion Theory: Examining Dynamic Social Networks and Human Behavior.” Statistics in Medicine 32 (4): 556–77. doi:10.1002/sim.5408.
Cohen-Cole, Ethan, and Jason M. Fletcher. 2008. “Is Obesity Contagious? Social Networks Vs. Environmental Factors in the Obesity Epidemic.” Journal of Health Economics 27 (5): 1382–7. doi:10.1016/j.jhealeco.2008.04.005.
Festinger, Leon. 1954. “Who Shall Survive?” Psychological Bulletin 51 (3). US: American Psychological Association: 322–23. doi:10.1037/h0049443.
Heath, Sue, Alison Fuller, and Brenda Johnston. 2009. “Chasing Shadows: Defining Network Boundaries in Qualitative Social Network Analysis.” Qualitative Research 9 (5). SAGE Publications: 645–61. doi:10.1177/1468794109343631.
Leskovec, Jure, Kevin J. Lang, and Michael Mahoney. 2010. “Empirical Comparison of Algorithms for Network Community Detection.” In Proceedings of the 19th International Conference on World Wide Web, 631–40. WWW ’10. Raleigh, North Carolina, USA: Association for Computing Machinery. doi:10.1145/1772690.1772755.
Mercken, L., T. A. B. Snijders, C. Steglich, E. Vartiainen, and H. de Vries. 2010. “Dynamics of Adolescent Friendship Networks and Smoking Behavior.” Social Networks 32 (1): 72–81. doi:10.1016/j.socnet.2009.02.005.
Murthy, Dhiraj. 2012. “Towards a Sociological Understanding of Social Media: Theorizing Twitter.” Sociology 46 (6). SAGE Publications Ltd: 1059–73. doi:10.1177/0038038511422553.
Noel, Hans, and Brendan Nyhan. 2011. “The ‘Unfriending’ Problem: The Consequences of Homophily in Friendship Retention for Causal Estimates of Social Influence.” Social Networks 33 (3): 211–18. doi:10.1016/j.socnet.2011.05.003.
Shalizi, Cosma Rohilla, and Andrew C. Thomas. 2011. “Homophily and Contagion Are Generically Confounded in Observational Social Network Studies.” Sociological Methods & Research 40 (2): 211–39. doi:10.1177/0049124111404820.
Tero, Atsushi, Seiji Takagi, Tetsu Saigusa, Kentaro Ito, Dan P. Bebber, Mark D. Fricker, Kenji Yumiki, Ryo Kobayashi, and Toshiyuki Nakagaki. 2010. “Rules for Biologically Inspired Adaptive Network Design.” Science 327 (5964). American Association for the Advancement of Science: 439–42. doi:10.1126/science.1177894.
Even if you're in a broader *nix category there are options...↩
See https://statmodeling.stat.columbia.edu/2011/06/10/christakis-fowl/ for some summary↩
Full disclosure: one of my examiners was a co-author on one of these papers...↩
1 (Social) Network Data
Today we're going to demonstrate simple ways of loading network data, visualisation, analysing structure, and showing how this can help answer research questions.
So a recurring point I'll make today is that there are lots of systems that can be represented as a network. Perhaps you're interested in social networks like twitter (Murthy 2012), but there are other types of networks that could be of interest (Tero et al. 2010):
1.1 What is a Network?
A network is a way of representing how things are connected (or not). They can be social (who tweets to who), economic (which companies employ which people), engineering (which parts were used in each product) etc. The representing of these connetions is a network--also call a graph in mathematics, (confusing eh?)--and in the social sciences it's a tool, often applied to help quantiatively (and sometimes qualitatively) answer research questions (Heath, Fuller, and Johnston 2009).
1.2 Installing
RandigraphWe're going to focus on the
igraphRpackage today. There are many other options butigraphis a fairly comprehensive package for getting started. If you're looking to go beyond what we cover today, I recommend looking through theigraphdocumentation for your particular interests before trying the other packages, and feel free to email me griffith.rees@sheffield.ac.uk if you've got questions. Please ask detailed questions, demonstrating what you're trying to do and what you've tried so far so I can efficiently reply.For those that have got
igraphinstalled feel free to continue further down this handout to play around with the data.Everyone else: please download
RStudiofrom https://rstudio.com/products/rstudio/download/#download.It should fit your operating system automatically (Windows, Linux or Mac). If you're not seeing an option (possibly old versions of Windows, macOS, or unusual Linux distributions) message me.1
Please fill in the survey when you're done or post messages if you're having issues.
Now let's install https://igraph.org/r/
and then load it in your
Rsession.1.3 Test loading
igraphdataWe're going to jump into visualisation with
igraphas a test of the install and a demonstration of visualisation options. Then we'll break it down into what's going on underneath but again, feel free to mess around with what's loaded as we continue on.We begin by loading node and edge data from Dr Evelina Gabašová's excellent dataset on which characters appeared together in scenes of star wars films. Shamlessly borrowing from an NYU short course created by Dr Pablo Barberá's we focus on Episode IV - A New Hope.
You can have a look at my github repository for this course: https://github.com/griff-rees/network-analysis-course and download the repository. That includes the code for this handout and the data we're playing around with today.
Once you've downloaded and unzipped and have a look in the data folder to make sure there are 4 csv files, including
star-wars-network-edges.csvandstar-wars-network-nodes.csvThe originals can be found at https://github.com/pablobarbera/data-science-workshop/tree/master/sna/data.1.3.1 Load Node/Edge CSVs
First load the csv of nodes and have it print out the list of records.
You should see two columns: list of character names from the film alongside a list of id numbers. We'll look at this in detail later.
Next load a list of edges:
Again we'll go into this in further detail but for now it's just good to know everyone's got these up and working. You should see three columns:
There are many ways of storing network information, and we'll look at other options later. For now: note the nodes are the names of characters and the edges are pairs of characters followed by the number of scenes they share in the film.
1.3.2 Creating a network
We'll look at how this works in detail in the next section, but note here that we load the
data.frameof edges into thedparameter and thedata.frameof nodes (in this characters) as vertices (another name for nodes in network analysis). We're settingdirected=FALSEfor simplicity (more on this later).1.3.3 Visualising a network
There are lots of ways to visualise networks. In part as a test of setup, we're jumping in to demonstrate how visualisation works, make sure everything's installed correctly and ensure the data is loading correctly as well.
Mostly illegible eh? Your plot will have a different layout but as long as it looks similar we should be ready to go.
Even with this rough layout we can, however, answer a research question:
Which character shares the least number of scenes with any other?
Sneak preview of what's to come: